Low-Density Language Bootstrapping: the Case of Tajiki Persian

نویسندگان

Karine Megerdoomian

Dan Parvaz

چکیده

Low-density languages raise difficulties for standard approaches to natural language processing that depend on large online corpora. Using Persian as a case study, we propose a novel method for bootstrapping MT capability for a low-density language in the case where it relates to a higher density variant. Tajiki Persian is a low-density language that uses the Cyrillic alphabet, while Iranian Persian (Farsi) is written in an extended version of the Arabic script and has many computational resources available. Despite the orthographic differences, the two languages have literary written forms that are almost identical. The paper describes the development of a comprehensive finite-state transducer that converts Tajik text to Farsi script and runs the resulting transliterated document through an existing Persian-to-English MT system. Due to divergences that arise in mapping the two writing systems and phonological and lexical distinctions, the system uses contextual cues (such as the position of a phoneme in a word) as well as available Farsi resources (such as a morphological analyzer to deal with differences in the affixal structures and a lexicon to disambiguate the analyses) to control the potential combinatorial explosion. The results point to a valuable strategy for the rapid prototyping of MT packages for languages of similar uneven density.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Design and implementation of Persian spelling detection and correction system based on Semantic

Persian Language has a special feature (grapheme, homophone, and multi-shape clinging characters) in electronic devices. Furthermore, design and implementation of NLP tools for Persian are more challenging than other languages (e.g. English or German). Spelling tools are used widely for editing user texts like emails and text in editors. Also developing Persian tools will provide Persian progr...

متن کامل

Feasibility of Automatically Bootstrapping a Persian WordNet

In this paper we describe a proof-of-concept for the bootstrapping of a Persian WordNet. This effort was motivated by previous work done at Stanford University on bootstrapping an Arabic WordNet using a parallel corpus and an English WordNet. The principle of that work is based on the premise that paradigmatic relations are by nature deeply semantic, and as such, are likely to remain intact bet...

متن کامل

Linguistic Issues in Language Technology LiLT

This paper presents an ongoing project whose goal is to create a freely available dependency treebank for Persian. The data is taken from the Bijankhan corpus, which is already annotated for parts of speech, and a syntactic dependency annotation based on the Stanford Typed Dependencies is added through a bootstrapping procedure involving the opensource dependency parser MaltParser. We report pr...

متن کامل

Linguistic Issues in Language Technology LiLT

In this paper, we describe an ongoing research to develop an HPSGbased treebank for Persian. To this aim, we use a bootstrapping approach for the data annotation. In the rst step, a set of seed rules are de ned as regular expressions in the CLaRK system. Then, the data is shallow processed with this set of rules. In the next step, a human annotator completes the annotation of sentences manually...

متن کامل

Management of cohesion in the written productions of monolingual Persian-speaking students with specific language disorder

Introduction: Students with specific language impairment (SLI) have many difficulties in producing coherent written texts The goal of this study was to investigate and compare the management of cohesion in the written production of individuals with SLI and their normal peers in terms of density and diversity of connectives, the density of punctuation marks (periods and commas) and density and d...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2008

Low-Density Language Bootstrapping: the Case of Tajiki Persian

نویسندگان

چکیده

منابع مشابه

Design and implementation of Persian spelling detection and correction system based on Semantic

Feasibility of Automatically Bootstrapping a Persian WordNet

Linguistic Issues in Language Technology LiLT

Linguistic Issues in Language Technology LiLT

Management of cohesion in the written productions of monolingual Persian-speaking students with specific language disorder

عنوان ژورنال:

اشتراک گذاری